CFAES Bioinformatics Core, Ohio State University
2026-01-29
High-throughput sequencing (HTS)
Sequences 105-109, usually randomly selected, DNA fragments (“reads”) at a time — two types:
Modified after Pereira, Oliveira, and Sousa (2020)
Modified after Pereira, Oliveira, and Sousa (2020)
Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time.
Sequencing is performed by synthesizing a new DNA strand with fluorescently-labeled nucleotides, using a different color for each base (A, C, G, T).
The final result is a chromatogram that can be “base-called”:
The entire human genome (3 Gbp) was sequenced with Sanger technology!
Anyone want to guess how much this may have cost?
https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost
With HTS, DNA can be sequenced much more efficiently and cheaply, and Sanger sequencing has become less widely used.
But it is not obsolete, in part because high throughput isn’t always needed –
some present-day uses of Sanger sequencing:
Examining variation among individuals or populations in one or more candidate or marker genes (for population genetics, phylogenetics, functional inferences, etc.)
Taxonomic identification of samples
Looking at the bigger picture first, HTS produces data that underlies several of these main “omics” approaches:
Copyright ThermoFisher
| Omics type | Molecule type | |
|---|---|---|
| Genomics | DNA | |
| Epigenomics | DNA modifications | High-throughput sequencing (HTS) |
| Transcriptomics | RNA | |
| Proteomics | Proteins | |
| Metabolomics | Metabolites |
The “omics” suffix indicates the involvement of large-scale datasets — in the sense that, for example, “genomics” data typically spans much or all of the genome.
While the boundaries can be fuzzy, sequencing a single gene in a single organism is not genomics, and running qPCR for a handful of genes is not transcriptomics.| Omics type | Molecule type | Data mainly produced by |
|---|---|---|
| Genomics | DNA | High-throughput sequencing (HTS) |
| Epigenomics | DNA modifications | High-throughput sequencing (HTS) |
| Transcriptomics | RNA | High-throughput sequencing (HTS) |
| Proteomics | Proteins | Mass Spectometry |
| Metabolomics | Metabolites | Mass Spectometry |
[ILLUSTRATION OF A TANGLED MASS OF READS - Recall that “reads” are sequenced fragments of DNA]
Assemble (“build back”) into a single sequence that can be used e.g. as a reference
Compare specific sequence variants across multiple samples
Count the number of reads originating from distinct units, such as genes (RNA-Seq) or organisms (microbial community characterization)
HTS read lengths vary from 300 bp and shorter (short-read HTS) up to tens of thousands of base pairs (long-read HTS).
For example:
For example:
Currently, no sequencing technology is error-free: the sequenced read may differ from the actual DNA sequence it came from.
When you receive HTS reads, base calls have typically been made already.
Every base call is accompanied by a quality score, representing the estimated error probability.
To overcome sequencing errors, every base can be sequenced multiple times –
i.e., obtaining a “depth of coverage” greater than 1:
Typical depths of coverage are ~50-100x for genome assembly and 10-30x for “resequencing” (!)
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Usage | More | Less (but increasing) |
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Usage | More | Less (but increasing) |
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
| Read lengths | 50-300 bp | 10-100+ kbp |
| Error rates | Mostly <0.1% | 1-10% (ONT) / <0.1-10% (PacBio) |
| Throughput | Higher | Lower |
| Cost per base | Lower | Higher |
| Short-read HTS | Long-read HTS | |
|---|---|---|
| Usage | More | Less (but increasing) |
| Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
| Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
| Read lengths | 50-300 bp | 10-100+ kbp |
| Error rates | Mostly <0.1% | 1-10% (ONT) / <0.1-10% (PacBio) |
| Throughput | Higher | Lower |
| Cost per base | Lower | Higher |
| AKA | Next-Generation Sequencing (NGS) | Third-generation sequencing |
100-300 bp reads with 0.1-0.2% error rates
More reads, lower per-base cost, and generally lower error rates than long-read sequencing.
In a HTS context, a “library” is a collection of DNA fragments ready for sequencing.
After library prep, each DNA fragment is flanked by several types of short sequences that together make up the “adapters”:
Multiplexing!
Adapters can include so-called “indices” or “barcodes” that identify individual samples. That way, up to 96 samples can be combined (multiplexed) into a single library,
i.e. into a single tube.
DNA fragments can be sequenced from both ends as shown below —
this is called “paired-end” (PE) sequencing:
When sequencing is instead single-end (SE), no reverse read is produced:
The insert size can vary – by design, but also because of limited precision in size selection. In some cases, it is:
First, library fragments bind to a surface thanks to the adapters, and the DNA templates are then PCR-amplified to form “clusters” of identical fragments:
In the diagram above, for illustrative purposes:
Then, sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:
This error profile is why, for Illumina:
The technologies underlying the two main long-read HTS technologies are very different, but have some commonalities beyond long reads — they:
Error rates are changing
I mentioned earlier that long-read HTS has higher error rate than short-read (Illumina) HTS.
However, error rates in one type of PacBio sequencing where individual fragments are sequenced multiple times (“HiFi”) are now lower than in Illumina.
A single strand of DNA passes through a nanopore —
the electrical current is measured, which depends on the combination of bases passes in the pore:
Under development!
ONT constantly releases new flow cells with updated technology, which have led to large decreases in error rates over the past decade — and even over the past two or so years.
Many HTS applications either require a “reference genome” or involve its production. What exactly does reference genome refer to? It usually includes:
An assembly
A representation of most or all of the genome DNA sequence: the genome assembly
An annotation
Provides e.g. locations of genes and other genomic “features” in the corresponding genome assembly, and functional information for these features
Taxonomic identity
Reference genomes are typically applicable at the species level. For example, if you work with maize, you want a Zea mays reference genome. But:
https://en.wikipedi.org/wiki/Karyotype
Key features:
Konkel and Slot (2023)
With increasing usage & quality of long-read HTS, assemblies are getting better and better
For chromosome-level assemblies, i.e. with one contiguous sequence for each chromosome, additional technologies than sequencing are often needed (e.g. Hi-C, optical mapping)
Many assemblies are not “chromosome-level”, but consist of –often 1000s of– fragments (contigs and scaffolds). Even chromosome-level assemblies are not 100% complete.
Contigs are contiguous, known stretches of DNA created by the assembly process, basically by overlapping reads.
Often, the order and orientation of two or more contigs is known, but there is a gap of unknown size between them. Such contigs are connected into scaffolds with a stretch ofNs in between.
How is this data stored?
Both genome assemblies and annotations are typically saved in a single text file each — we’ll explore some of these files in tomorrow’s lab.
TBA
The labs this and next week are organized around the data set from Garrigós et al. (2025):
This paper uses paired-end Illumina RNA-Seq data to study gene expression in Culex pipiens mosquitos infected with two different malaria-causing Plasmodium protozoans.